Analysis of the Gapminder dataset

In this training, we’ll analyze the gapminder dataset, which contains various demographic and economic metrics for many of the world’s countries over time.

Question 1

We’ll start by reading in the data and inspecting it. We see a superfluous column “X1” that is just a duplicate of the row index so we remove it.

gapminder <- read_csv("gapminder_clean.csv")
## Warning: Missing column names filled in: 'X1' [1]
## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   `Country Name` = col_character(),
##   continent = col_character()
## )
## i Use `spec()` for the full column specifications.
glimpse(gapminder)
## Rows: 2,607
## Columns: 20
## $ X1                                                        <dbl> 0, 1, 2, ...
## $ `Country Name`                                            <chr> "Afghanis...
## $ Year                                                      <dbl> 1962, 196...
## $ `Agriculture, value added (% of GDP)`                     <dbl> NA, NA, N...
## $ `CO2 emissions (metric tons per capita)`                  <dbl> 0.0737813...
## $ `Domestic credit provided by financial sector (% of GDP)` <dbl> 21.276422...
## $ `Electric power consumption (kWh per capita)`             <dbl> NA, NA, N...
## $ `Energy use (kg of oil equivalent per capita)`            <dbl> NA, NA, N...
## $ `Exports of goods and services (% of GDP)`                <dbl> 4.878051,...
## $ `Fertility rate, total (births per woman)`                <dbl> 7.450, 7....
## $ `GDP growth (annual %)`                                   <dbl> NA, NA, N...
## $ `Imports of goods and services (% of GDP)`                <dbl> 9.349593,...
## $ `Industry, value added (% of GDP)`                        <dbl> NA, NA, N...
## $ `Inflation, GDP deflator (annual %)`                      <dbl> NA, NA, N...
## $ `Life expectancy at birth, total (years)`                 <dbl> 33.21990,...
## $ `Population density (people per sq. km of land area)`     <dbl> 14.312061...
## $ `Services, etc., value added (% of GDP)`                  <dbl> NA, NA, N...
## $ pop                                                       <dbl> 10267083,...
## $ continent                                                 <chr> "Asia", "...
## $ gdpPercap                                                 <dbl> 853.1007,...
gapminder$X1 <- NULL
glimpse(gapminder)
## Rows: 2,607
## Columns: 19
## $ `Country Name`                                            <chr> "Afghanis...
## $ Year                                                      <dbl> 1962, 196...
## $ `Agriculture, value added (% of GDP)`                     <dbl> NA, NA, N...
## $ `CO2 emissions (metric tons per capita)`                  <dbl> 0.0737813...
## $ `Domestic credit provided by financial sector (% of GDP)` <dbl> 21.276422...
## $ `Electric power consumption (kWh per capita)`             <dbl> NA, NA, N...
## $ `Energy use (kg of oil equivalent per capita)`            <dbl> NA, NA, N...
## $ `Exports of goods and services (% of GDP)`                <dbl> 4.878051,...
## $ `Fertility rate, total (births per woman)`                <dbl> 7.450, 7....
## $ `GDP growth (annual %)`                                   <dbl> NA, NA, N...
## $ `Imports of goods and services (% of GDP)`                <dbl> 9.349593,...
## $ `Industry, value added (% of GDP)`                        <dbl> NA, NA, N...
## $ `Inflation, GDP deflator (annual %)`                      <dbl> NA, NA, N...
## $ `Life expectancy at birth, total (years)`                 <dbl> 33.21990,...
## $ `Population density (people per sq. km of land area)`     <dbl> 14.312061...
## $ `Services, etc., value added (% of GDP)`                  <dbl> NA, NA, N...
## $ pop                                                       <dbl> 10267083,...
## $ continent                                                 <chr> "Asia", "...
## $ gdpPercap                                                 <dbl> 853.1007,...

Question 2

We’ll first restrict our analysis to the year 1962 and compare each country’s CO2 emissions with its GDP per capita. Since both these quantities stretch several orders of magnitude, we’ll plot them on logarithmic axes. We observe that there appears to be an overall positive correlation between these two metrics.

gapminder %>%
  filter(Year == 1962) %>%
  drop_na(`CO2 emissions (metric tons per capita)`, gdpPercap) %>%
  ggplot(mapping = aes(x = `CO2 emissions (metric tons per capita)`, y = gdpPercap)) +# Backticks allow using variable names with spaces
  scale_x_log10() +
  scale_y_log10() +
  labs(title = "CO2 emissions vs. GDP per capita in 1962") +
  ylab("GDP per capita") +
  geom_point()

Question3

In order to quantify the relationship between CO2 emissions and GDP per captia, we calculate the Pearson’s R value and its associated p-value. We conclude from the very small p-value that we have extreme confidence in the positive correlation.

gapminder_1962 <- gapminder %>%
  filter(Year == 1962) %>%
  drop_na(`CO2 emissions (metric tons per capita)`, gdpPercap)

cor_res <- cor.test(gapminder_1962$`CO2 emissions (metric tons per capita)`, gapminder_1962$gdpPercap, method = "pearson")
cat(paste0("The Pearson R value for the correlation between CO2 emissions and GDP per captia in 1962 is: ", cor_res[["estimate"]], 
           "\nThe associated p-value for that correlation is: ", cor_res[["p.value"]]))
## The Pearson R value for the correlation between CO2 emissions and GDP per captia in 1962 is: 0.926081672501947
## The associated p-value for that correlation is: 1.12867922100394e-46

Question 4

Given the correlation between CO2 emissions and GDP per capita in the year 1962, we next ask what year in the dataset had the strongest correlation between these two metrics. By grouping the data by year and calculating the Pearson’s R value for each year, we can determine that the strongest correlation was in the year 1967.

gapminder %>%
  group_by(Year) %>%
  drop_na(`CO2 emissions (metric tons per capita)`, gdpPercap) %>%
  summarize(correlation = cor(`CO2 emissions (metric tons per capita)`, gdpPercap)) %>%
  arrange(desc(correlation))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 10 x 2
##     Year correlation
##    <dbl>       <dbl>
##  1  1967       0.939
##  2  1962       0.926
##  3  1972       0.843
##  4  1982       0.817
##  5  1987       0.810
##  6  1992       0.809
##  7  1997       0.808
##  8  2002       0.801
##  9  1977       0.793
## 10  2007       0.720

Question 5

We now focus on the year 1967 and visualize the relationship in some more detail for this year. As we can see from the interactive plot, European nations tend to be situated at the top right corner of the graph, indicating high CO2 emissions and GDP per capita, which is consistent with most European nations’ status as industrialized powers. Conversely African countries tend to cluster in the low-emissions, low-GDP per capita area of the graph due to their status as developing nations, and American countries, with the exception of the USA and Canada, occupy an intermediate position.

gapminder_1967 <- gapminder %>%
  filter(Year == 1967) %>%
  drop_na(`CO2 emissions (metric tons per capita)`, gdpPercap, continent, pop)

plot <- ggplot(data = gapminder_1967, 
               mapping = aes(x = `CO2 emissions (metric tons per capita)`, 
                             y = gdpPercap, color = continent, 
                             size = pop, label = `Country Name`)) +
  geom_point() +
  scale_x_log10() +
  scale_y_log10() +
  ylab("GDP per capita") +
  labs(title = "CO2 emissions vs. GDP per capita in 1967") +
  theme(legend.title = element_blank())

ggplotly(plot, tooltip = "label")

Question 6

We now turn our attention to the mean energy consumption per continent over the years. Grouping the dataset by year and continent allows us to visualize the difference in average energy consumption between continents. The first two years were omitted due to very little data being available for Africa and the Americas for those timepoints. We can determine from the graphs, as well as the low p-values associated with ANOVA analyses, that there is a significant difference in mean energy consumption between continents at all timepoints.

gapminder %>%
  drop_na(continent, `Energy use (kg of oil equivalent per capita)`) %>%
  filter(Year >1967) %>%
  group_by(Year, continent) %>%
  mutate(Energy_Use_Mean = mean(`Energy use (kg of oil equivalent per capita)`)) %>%
  mutate(Energy_Use_SEM = sd(`Energy use (kg of oil equivalent per capita)`)/sqrt(n())) %>%
  ggplot(mapping = aes(x = continent, y = `Energy use (kg of oil equivalent per capita)`)) + 
  facet_wrap(~ Year, scales = "free") +
  geom_bar(stat = "summary", fun = "mean") +
  geom_errorbar(aes(ymin = Energy_Use_Mean - Energy_Use_SEM, ymax = Energy_Use_Mean + Energy_Use_SEM)) +
  stat_compare_means(method = "anova", label.x = 2) +
  theme(axis.text.x = element_text(angle = 90))

Question 7

We will next focus on imported goods in Europe and Asia after 1990. At none of the years after 1990 in the dataset was there a significant difference in mean imported goods as a percentage of GDP between Europe and Asia. This indicates that, on average, both continents spent a similar proportion of their economic power on the purchase of goods from abroad.

gapminder %>%
  filter(Year > 1990) %>%
  filter(continent %in% c("Europe", "Asia")) %>%
  drop_na(`Imports of goods and services (% of GDP)`) %>%
  group_by(Year, continent) %>%
  mutate(Avg_Imports = mean(`Imports of goods and services (% of GDP)`)) %>%
  mutate(SEM_Imports = sd(`Imports of goods and services (% of GDP)`)/sqrt(n())) %>%
  ggplot(mapping = aes(x = continent, y = `Imports of goods and services (% of GDP)`)) +
  geom_bar(stat = "summary", fun = "mean") +
  facet_wrap(~ Year, scales = "free") +
  geom_errorbar(aes(ymin = Avg_Imports - SEM_Imports, ymax = Avg_Imports + SEM_Imports)) +
  stat_compare_means(method = "t.test", label = "p.signif", comparisons = list(c("Asia", "Europe")))

Question 8

This next analysis concerns itself with population density. Specifically, we’re asking the question of which country had the highest population density throughout the years. For this, we rank the countries by population density in each year for which we have data, and then we calculate an average rank for each country. The resulting table tells us that Macao and Monaco are tied for the country with the highest average population density.

gapminder %>%
  drop_na(`Population density (people per sq. km of land area)`) %>%
  select(`Population density (people per sq. km of land area)`, Year, `Country Name`) %>%
  group_by(Year) %>%
  mutate(rank = min_rank(desc(`Population density (people per sq. km of land area)`))) %>%
  group_by(`Country Name`) %>%
  summarize(avg_rank = mean(rank)) %>%
  arrange(avg_rank)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 262 x 2
##    `Country Name`            avg_rank
##    <chr>                        <dbl>
##  1 Macao SAR, China               1.5
##  2 Monaco                         1.5
##  3 Hong Kong SAR, China           3.1
##  4 Singapore                      3.9
##  5 Gibraltar                      5  
##  6 Bermuda                        6.2
##  7 Malta                          7  
##  8 Bangladesh                     9.2
##  9 Channel Islands                9.4
## 10 Sint Maarten (Dutch part)     10.5
## # ... with 252 more rows

Question 9

Lastly, we would like to know which country saw its life expectancy increase the most from 1962 to 2007. For this we will eliminate all countries with no life expectancy data in 1962 and calculate the increase between then and 2007. The resulting table tells us that the Maldives have seen the greatest improvement, with life expectancy rising by almost 37 years since 1962.

gapminder %>%
  drop_na(`Life expectancy at birth, total (years)`) %>%
  select(`Country Name`, Year, `Life expectancy at birth, total (years)`) %>%
  pivot_wider(names_from = Year, values_from = `Life expectancy at birth, total (years)`) %>%
  mutate(`Increase in life expectancy from 1962 to 2007` = `2007` - `1962`) %>%
  select(`Country Name`, `1962`, `2007`, `Increase in life expectancy from 1962 to 2007`) %>%
  arrange(desc(`Increase in life expectancy from 1962 to 2007`))
## # A tibble: 252 x 4
##    `Country Name`     `1962` `2007` `Increase in life expectancy from 1962 to 2~
##    <chr>               <dbl>  <dbl>                                        <dbl>
##  1 Maldives             38.5   75.4                                         36.9
##  2 Bhutan               33.1   66.3                                         33.2
##  3 Timor-Leste          34.7   65.8                                         31.1
##  4 Tunisia              43.3   74.2                                         30.9
##  5 Oman                 44.3   75.1                                         30.8
##  6 Nepal                36.0   66.6                                         30.6
##  7 China                44.4   74.3                                         29.9
##  8 Yemen, Rep.          34.7   62.0                                         27.2
##  9 Saudi Arabia         46.7   73.3                                         26.7
## 10 Iran, Islamic Rep.   46.1   72.7                                         26.6
## # ... with 242 more rows